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Q1 (13 pts) 
a) Find the word or phrase from the table below that best matches the description. 
(4pts) 
a. | Cache Small, fast memory that acts as a buffer for the main memory. 
b. | ISA Specific interface that the hardware provides the low-level 


software. 





c. | Assembler 


Program that converts a symbolic version of an instruction into 
the binary version. 





d. | CPU 


Component of the computer where all running programs and 
associated data reside. 





e. | Instruction 


Single software command to a processor. 











f. | MIPS Sometimes is used as a Performance metric. 

g. | Miss Penalty Is the time to replace block in upper level plus time to deliver 
data to the processor. 

h. | Hit Time Is the time to access the upper level of the memory hierarchy, 


which includes the time needed to determine whether the access 











is a hit or a miss. 





b) What is the advantages of IEEE-754 standard for floating point numbers?. (3pts) 


O 


Simplified presenting of floating-point numbers. Unified the algorithms of 
floating-point numbers. Increased the accuracy of floating-point numbers. 
Encoding of exponent and fraction simplifies comparison, integer comparator 
used to compare magnitude of FP numbers. 

Includes special exceptional values: NaN and +œ. Special rules are used such 
as: 0/0 is NaN, sqrt(—1) is NaN, 1/0 is œ, and I/o is 0. Computation may 
continue in the face of exceptional conditions. 

Denormalized numbers to fill the gap between smallest normalized number 1.0 


Emin 


Emi . 
x 2°" and zero Denormalized numbers, values 0.F x 2 , are closer to 


Zero. 


c) Explain the replacement and write policies in the cache memory. (3pts) 


Replacement: 


e Random replacement : Candidate blocks are randomly selected. One 


counter for all sets (0 to m — 1): incremented on every cycle. On a cache 
miss replace block specified by counter. 
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e First In First Out (FIFO) replacement: Replace oldest block in set. One 
counter per set (0 to m — 1): specifies oldest block to replace. Counter is 


incremented on a cache miss. 
e Least Recently Used (LRU): Replace block that has been unused for the 
longest time. Order blocks within a set from least to most recently used. 


Update ordering of blocks on each cache hit. With m blocks per set, there 


are m! possible permutations. 


Write: 


o Write through - write to memory, stall processor until done. 


o Write back - delay write to memory until block is replaced in cache. 


o Write buffer - place in buffer. Used in pipeline allows pipeline to continue. 


d) What is the difference between signed and unsigned arithmetic instructions in 


MIPS processor?. (3pts) 


The difference between signed and unsigned arithmetic instructions in MIPS 


processor is to control whether a trap is executed on overflow (Add instructions) 


or an overflow is ignored (Add unsigned instruction). 


Q2 (17 pts) 


a) The following steps are used to multiply two 5 bits signed numbers. 
Which algorithm was used to perform this multiplication? (2pts) 
Complete the steps in the table and find the value of Multiplier, Multiplicand and 


Product in decimal? (7pts) 























1 | Shift Right 
Shift Right 
3 Subtract 
Shift Right 
4 Add 
Shift Right 
5 | Subtract 











Result after subtraction 000011000010 








The Algorithm is Booth's Algorithm for Signed Multiplication. 


Works for two’s complement numbers. Key idea: test 2 bits of multiplier at 


once. 


10 - subtract (beginning of run of 1’s) 


0l - add (end of run of I’s) 


00, 11 - do nothing (middle of run of 0’s or 1’s) 


Page 2 of 11 


Tripoli University Computer Architecture (EE434) 
EEE-Department Fall-2012 


b 


The number is signed and 5 bit 

The result at step 5 after subtraction is 00011000010 

Last step is shift right after subtraction : 00001100001 
Product (PR) = 0000110000 = 4810 


Multiplier (MR): 
In step 5 Subtract > 000011000010 .( 10 = MR4 & MR3) 





MR4 | MR3 | MR2 | MRI | MRO 








wm 

















1 0 ? ? ? 





In step 4 Add (01) >(01 = MR3 & MR2) 





MR4 | MR3 | MR2 | MRI | MRO 








1 0 | ? ? 




















In step 3 Subtract (10) > (10 = MR2 & MRI) 





MR4 | MR3 | MR2 | MRI | MRO 








1 0 1 0 2 




















In step 2 Shift Right (00 or 11) > but because in step 3 is subtraction > the two bits 
should be 00 not 11 > (00 = MRI & MRO) 





MR4 | MR3 | MR2 | MRI | MRO 























1 0 1 0 0 





MR = 10100 = —1210 
MC = PR/MR = 48/ —12= +4. 


Calculate the following half precession floating point arithmetic operations. (8pts) 


Half Precession format ( 1bit for sign, 5 bits for Exponent, and 10 bits for Fraction) 
1. 
0111111110000010 
- 0000111011000000 


O 11111 1110000010 

S= 0 

E=31 

Fraction Non-zero 

-< This number is a special case (NaN) 
NaN operation any number = NaN. 





ii. 
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0111111000000001 
x 1000111011000000 


O 11111 1000000001 

S= 0 

E=31 

Fraction Non-zero 

-< This number is a special case (NaN) 
NaN operation any number = NaN. 


Q3 (20 pts) 


a) Write a sequence of MIPS instructions which can discover if there is an overflow or 
not in signed addition. (4pts) 


add $t0,St1,St2 # StO = sum. 

XOF StI; StL; St2 # Check if signs differ 

slt S$t3,St3,Szero # St3 = 1 if signs differ 

bne $t3,şzero,No overflow # Stl, St2 signs so no overflow 
xor St3,St0;St1 # signs =; sign of sum match too? 


# St3 negative if sum sign different 
slt S$t3,St3,Szero # St3 = 1 if sum sign different 


bne $t3,Szero,Overflow # All three signs #; go to overflow 


b) Two different compilers are being tested on the same program for a 6 GHz machine 
with three different classes of instructions: Class A (branch instructions) , Class B 
(load instructions), and Class C (other instructions), which require 3, 5, and 1 cycles, 
respectively. The instructions produced by the first and the second compiler are shown 














below. 
Compiler 1 Compiler 2 
lui $s0,0x100 lui $s0,0x100 
ori $s0,$s0,0x80fc ori $s0,$s0,0x80fc 
ori $s6,$0,0 ori $s1,$0,255 
ori $t0,$0,0 or $s2,$0,$0 
next: lbu $s5,0($s0) or $s6,$0,$0 
add $s6,$s6,$s5 or $s4,$0,$0 
addi $s0,$s0,1 ori $t0,$0,3 
addi $t0,$t0,1 lw $s5,0($s0) 
slti $t1,$t0,3 next: and $s3,$s5,$s1 
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bne $t1,$0,next srlv $s3,$s3,$s4 
div $s6,$t0 add $s6,$s6,$s3 
mflo $s6 addi $s4,$s4,8 
mfhi $s7 sll $s1,$s1,8 


slti $t1,$s4,24 
bne $t1,$0,next 
div $s6,$t0 
mflo $s6 

mfhi $s7 














The function of this program is to calculate the medium of 3 numbers, which 
are sorted as a big-endian in the memory. 





ADDRESS DATA 
0x010080FC | 0x19230DF7 
0x01008100 | 0xF5CE67A3 
0x01008104 | 0x112BC49A 























i. What is the content of registers $s6 and $s7 in decimal after execution this 
program? (3pts). 
The Content of $s6 and $s7: 
The function of the program is to calculate medium of 3 numbers 
Let us track the instructions produced by compiler 1 
$s5 = Ox010080FC > the base address. 
Memory byte order is big-endian > [0Ox010080FC] = 0x19 
[0x010080FE] = 0x23 
[0x010080FD] = Oxd 
LBU instruction is used to load one byte from the memory to register $s5 
$s6 initially =0 
$56 =$s6+$s5 = 0x23 
next loop $s6= O0x23+0x19 
next loop $s6= 0x23+0x19+0xd 
Finally $s6/$t0 : $t3 at last loop = 3 
The quotient will be in LO register and Remainder in HI Register 
mflo $s6 
Content of $s6 = 0x18 = 2410 
mfhi $s7 
Content of $s7 = Ox1l= 1 
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ii. Which compiler produces a better execution time? (8pts). 
Clock Rate = 6 GHz 











Class CPI Instruction Instruction 
Count (IC) Count (IC) 
Compiler 1 Compiler 2 

Class A 3 3 3 

(Branch instructions) 

Class B 5 3 I 

(Load instructions) 

eHe hate i ELEY! cea 


os al welds pw ISM ia cy 
LBU and LW 
instructions 


Class C 1 7+4X3=19 104+6 X3=28 
(Other instructions) 























Total IC =25 Total IC = 32 





Complier 1: 
CPU Cycles Compiler =p CPI; 1C;=3 X3+5X3+1X19= 43 cycles 
CPU ex. time = CPU Cycles / Clock Rate = 43/6GHZ = 7.16 n sec 


Complier 2: 
CPU Cycles Compiler 2=Qup=1 CPI; 1C;=3 X3+5X1+1X28= 42 cycles 
CPU ex time = CPU Cycles / Clock Rate = 42/6GHZ = 7 n sec 


-- Compiler C2 produces a better execution time. 


iii. Which compiler produces a higher MIPS? (Spits). 
MIPS = IC/( CPU ex time * 10%6) 

Compiler 1: MIPS = 3491.6 Million instruction per second 
Compiler 2: MIPS = 4571.4 Million instruction per second 
. Compiler 2 produces a higher MIPS. 


Q4 (25 pts) 
Consider the following single-cycle datapath for the MIPS processor implementing a subset 


of the instruction set (R-Type, Immediate Arithmetic and Logic, LW/SW, jump and branch 
instructions): 
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30 Jump or Branch Target Address 





Instruction 


Memory 
Instruction 


Memory 
Address 
Address Data_out 













ALUSre ALUCtrl 


RegWrite| ExtOp 


Op 
MemRead 





oReg 














a) If ExtOp signal has a stuck at 0 or | faults. Which instructions mentioned above will 
not work correctly? Explain why. 
(i) ExtOp stuck at 0. (Spts) 
Immediate Arithmetic (e.g. addi), LW and SW instructions will not work correctly. 
Because there are arithmetic operations so we have to extend the sign of the 16 bit 
immediate [15:0], (Sign Extended). 
(ii) ExtOp stuck at 1. (Spts) 
Immediate Logic (e.g. ori) will not work correctly. 
Because it is a logical operation so we don't need to extend the sign of the 16 bit 
immediate [15:0], (Zero Padding). 


b 


w 


Suppose that the ALU has a shifter which is used for R-type shift instructions. And 
suppose that the content of $s0 = 0x010080CC. Explain how can you modify the 
datapath and control unit to implement (lui, lb & lbu) instruction. Draw the modified 
datapath only and write all the control signals values as shown in the table below. 
(lui rt, immediate), lb $sl,imm($s0)#(lb rt, imm(rs)) 

lbu $s2, imm($s0) #(lbu rt, imm(rs)) . (I5pts) 


Lil À easy ai 
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Op | PC | Reg | Reg | Ext | ALU | ALU | Beg | Bne | J | Mem | Mem | Memto | Additional Control 
Src | Dst | Write | Op | Sre Op Read | Write | Reg Signals 
LUI | EXT | LB/LBU 
Op2 
lui | 1 1 1 x 0 sll 0 0 |0 0 0 1 1 x 0 
lb 1 1 1 1 0 add 0 0 i0 1 0 0 x 1 1 
lbu| 1 1 1 1 0 add 0 0 i0 1 0 0 x 0 1 


















































c) Implement add, xori & beq instructions to the data-path. Write all the control signals 


values as shown in the table below. (add rd,rs,rt), (xori 


(beq rs,rt,label). (Opts) 





re, eS, 





imm) , 



























































Op | PC | Reg | Reg | Ext | ALU | ALU | Beg | Bne | J | Mem | Mem | Memto 
Src | Dst | Write | Op Src Op Read | Write | Reg 
add| 1 0 1 x 1 add | 0 0 O| 0 0 l 
xori) l 1l 1 0 0 xor 0 0 |0) 0 0 l 
beq | 0 x 0 x 1 sub 1 0 |0) 0 0 x 
Lahaj Ss a & e Call gully olan! b alal diydi Go gle ihl Apii ai ala 
Q5 (25 pts) 


a) Consider a cache is organized as a direct-mapped with 2 bits offset, and total amount of 


bits in each row is 42 bits. Assume that the cache is initially empty. 
Compute the total number of bits required to implement this cache. (3pts) 


i. 


i. 


iii. 


b = 2 bits > block size = 2^b =2/2 = 4 bytes 


Total amount of bits in each row =42 = valid bit + tag bits + block size = 1+ 


tag bits + 4 * 8. 


42=I1+tag bits + 32 > tag bits = 42-33 = 9 bits. 
From the table reference address size is 16 bit - > index bits = 16-9-2 =5 bits 
Number of blocks = 2^n = 245 =32 block. 


Total Number of bits required to implement this cache = total number of bits in 
each row * number of blocks =42*32= 1344 bits = 168 byte. 
What is the size of this cache according to the number of bytes stored in the 


cache? (3pts) 


Cache data size = 2\(n+b)= 2\(54+2)= 2^7 = 128 bytes. 
Determine whether if the addresses are hit or miss, and calculate the miss rate. 


(4pts) 


2099, 2110, 1971, 1599, 2111, 1968, 1598, 1982, 2097, 2111, 1982, 2098 
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16 bit Address Tag Index Offset Hit / 

Byte Address (binary) Decimal | Decimal | Decimal | Miss 

(Decimal) 

2099 00001 00000110011 | 16 12 3 Miss 
2110 00001 0000 0111110 | 16 15 2 Miss 
1971 00000 1111 0110011 | 15 12 3 Miss 
1599 00000 11000111111 | 12 15 3 Miss 
2111 0 0001 0000 0111111 | 16 -15 3 Miss 
1968 00000 11110110000 | 15 12 0 Hit 
1598 00000 11000111110 | 12 15 2 Miss 
1982 00000 11110111110 | 15 15 2 Miss 
2097 0 0001 0000 0110001 | 16 12 1 Miss 
2111 0 0001 0000 0111111 | 16 15 3 Miss 
1982 00000 11110111110 | 15 15 2 Miss 
2098 00001 0000 0110010 | 16 12 2 Hit 

MISS Rate (%) = 10/12 = 83.3 % 











b) Consider a cache is organized as a fully associative with four comparators, and the 
cache data size is 32 byte. Calculate the miss rate for these reference addresses. Assume 
that the cache is initially empty and the replacement policy which used in this cache is 
FIFO. (15 pts) 

107, 126, 111, 86, 76, 70, 107, 74, 86, 107, 125, 86 


Cache data size = 32 bytes = m * 2^ b = 4 * 2%b > b= 3 bits > tag = 16-3 =13 bits 
The replacement policy is FIFO. 
























































16 bit Address Tag Offset Hit / 
Byte Address (binary) Decimal | Decimal | Miss 
Decimal 
107 1101011 13 3 Miss 
126 1111110 15 6 Miss 
111 1101111 13 7 Hit 
86 1010 110 10 6 Miss 
76 1001 100 9 4 Miss 
70 1000 110 8 6 Miss 
107 1101011 13 3 Miss 
74 1001 010 9 2 Hit 
86 1010 110 10 6 Hit 
107 1101011 13 3 Hit 
125 1111101 15 5 Miss 
86 1010 110 10 6 Miss 
MISS Rate (%) = 8/12 = 66.667 % 
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c) Consider three processors with different cache configurations. Cache 1: Direct- 
mapped, cache data size 64 bytes with 16 blocks, Instruction miss rate is 3%; data 
miss rate is 5%. Cache 2: Direct-mapped with 3bits offset, Instruction miss rate is 
2%; data miss rate is 6%. Cache 3: Two-way set associative, cache data size 64 bytes 
with 3bits index, Instruction miss rate is 2%; data miss rate is 3%. For these 
processors, 65% of the instructions contain a data reference. Assume that the cache 
miss penalty is 8 + block size in bytes. The CPI for this workload was measured on a 
processor with cache | and was found to be 2.3. 

i. Determine which processor spends the most cycles on cache misses. (4pts) 











Cache 1 Cache 2 Cache 3 
Block size Direct Mapped Direct Mapped | 2-way set associative 
Cache data 64 bytes. 3bits offset >| Cache data 64 bytes. 
Number of block= 16> | Block size = | Index (n) = 3 bits 
2 =16 > n=4 bits 2^b=8 bytes 64= m* 2*%(n+b) 
64=2*(n+b) =2(4+b) 64= 2* 2\(3+b) 
b=2 Block Size = b=2->? 
242=4 bytes Block Size = 242=4 bytes 
Miss penalty | 8+4=12 $+8=16 8+4=]2 
8+ block zize 














Memory Stall Cycles Per Instruction = I-Cache Miss Rate x Miss Penalty + 
LS Frequency x D-Cache Miss Rate x Miss Penalty 
Combined Misses Per Instruction = I-Cache Miss Rate + LS Frequency x D-Cache Miss Rate 


Stall Cycles Per Instruction = Combined Misses Per Instruction x Miss Penalty 
65% of instructions contain data reference > LS frequency = 0.65 




















Cache 1 Cache 2 Cache 3 
Combined 0.03+0.65 * 0.05 = | 0.024+0.65 * 0.06 | 0.02+0.65 * 0.03 = 0.0395 
Misses Per | 0.0625 = 0.059 
Instruction 
Memory Stall | 0.0625*12=0.75 stall | 0.059*16=0.944 | 0.0395*12=0.474 stall cycles 
Cycles Per | cycles per instruction stall cycles per | per instruction 
Instruction instruction 








-+ Cache 2 spends the most cycles on cache misses 








ii. If the cycle times for the first and third processors are 420 ps, and 310 ps for 
the second processor. Determine which processor is the fastest and which is 
the slowest. (4pts) 

CPU Time = I-Count x CPI yemorystalls X Clock Cycle 

CPI Memorystalls = CPI perfectcache + Mem Stalls per Instruction 
CPlIMemorystalls for Cache 1 is given = 2.3 cycles per instruction 

CPI erfectcache= CPI MemoryStails (Cache 1) - Mem Stalls per Instruction (cache 1) 
CPI erfectCache= 2.3 - 0.75 = 1.55 cycles per instruction 

CPI MemoryStalls (Cache 2)= 1.55+0.944=2.494 cycles per instruction 

CPI MemoryStalls (Cache 3)= 1.55+0.474=2.024 cycles per instruction 


CPU Time = I-Count x CPI yemorystalls X Clock Cycle 
CPU Time cache 1= IC x 2.3 x 420 = 966 IC ps 
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CPU Time cache 2= IC X 2.494 x 310 = 773.14 IC ps 


CPU Time cache 3= IC x 2.024 x 420 = 850.08 IC ps 
-< The processor on cache 2 is the fastest and the processor on cache 3 is the 
slowest 
iii. By how much the fastest processor is faster than the slowest one. (2pts) 
The fastest processor is faster than the slowest one by CPU Time cache 3/ CPU 
Time cache 2= 1,25. 


Lalaj dy a & e oyd sla! b oliti Sigal) Gi ple Agta!) Apii ai alll 


GOOD LUCK © 


Page 11 of 11 


